This challenge uses data from NYC’s Open Archive. I used the NYC Restaurant Inspection dataset (https://data.cityofnewyork.us/Health/restaurant-data-set-2/f6tk-2b7a), which has information on restaurant health code violations in New York City. This challenge will explore the relationship (if any) between restaurant cuisine type and current health code grade.
library(tidyverse)
## -- Attaching packages ---------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.5
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## -- Conflicts ------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(reshape)
## Warning: package 'reshape' was built under R version 3.5.1
##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 3.5.1
library(MASS)
## Warning: package 'MASS' was built under R version 3.5.1
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
file_data <-
"C:/Users/Claire/Downloads/DOHMH_New_York_City_Restaurant_Inspection_Results.csv"
df_raw <- file_data %>% read_csv()
## Parsed with column specification:
## cols(
## CAMIS = col_integer(),
## DBA = col_character(),
## BORO = col_character(),
## BUILDING = col_character(),
## STREET = col_character(),
## ZIPCODE = col_character(),
## PHONE = col_character(),
## `CUISINE DESCRIPTION` = col_character(),
## `INSPECTION DATE` = col_character(),
## ACTION = col_character(),
## `VIOLATION CODE` = col_character(),
## `VIOLATION DESCRIPTION` = col_character(),
## `CRITICAL FLAG` = col_character(),
## SCORE = col_integer(),
## GRADE = col_character(),
## `GRADE DATE` = col_character(),
## `RECORD DATE` = col_character(),
## `INSPECTION TYPE` = col_character()
## )
# Clean up dataset's variable names to remove spaces and make lowercase
df <-
df_raw %>%
dplyr::rename_all(funs(make.names(.))) %>%
rename_all(tolower)
# Explore cuisine type labels
df %>%
count(cuisine.description)
There are 85 different types of cuisines, which is far too many to plot. We will consolidate these cuisine types (e.g. combining “French”, “German”, “Portuguese” etc. into one large “Western European” category). I have created a new data frame below, called df_cuisine_labels, that shows how each cuisine description was categorized.
# Create new labels in a dataset called df_cuisine_labels
df_cuisine_labels <-
tribble(
~orig_label, ~new_label,
"Middle Eastern", "Middle Eastern",
"Afghan", "Middle Eastern",
"Armenian", "Middle Eastern",
"Iranian", "Middle Eastern",
"Moroccan", "Middle Eastern",
"Egyptian", "Middle Eastern",
"Turkish", "Middle Eastern",
"South Asian", "South Asian",
"Pakistani", "South Asian",
"Indian", "South Asian",
"Bangladeshi", "South Asian",
"African", "African",
"Ethiopian", "African",
"Californian", "American",
"Tex-Mex", "American",
"Southwestern", "American",
"Continental", "American",
"Soul Food", "American",
"Chicken", "American",
"Barbecue", "American",
"American", "American",
"Thai", "Asian",
"Filipino", "Asian",
"Vietnamese/Cambodian/Malaysia", "Asian",
"Asian", "Asian",
"Indonesian", "Asian",
"Chinese", "Asian",
"Chinese/Cuban", "Asian",
"Chinese/Japanese", "Asian",
"Japanese", "Asian",
"Korean", "Asian",
"Australian", "Australian",
"Bakery", "Bakery",
"Bagels/Pretzels", "Bakery",
"Café/Coffee/Tea", "Beverages",
"Bottled beverages, including water, sodas, juices, etc.", "Beverages",
"Caribbean", "Islands",
"Hawaiian", "Islands",
"Polynesian", "Islands",
"Creole/Cajun", "Creole/Cajun",
"Creole", "Creole/Cajun",
"Cajun", "Creole/Cajun",
"Czech", "Eastern European",
"Polish", "Eastern European",
"Russian", "Eastern European",
"Eastern European", "Eastern European",
"Portuguese", "Western European",
"English", "Western European",
"Scandinavian", "Western European",
"Spanish", "Western European",
"Tapas", "Western European",
"Basque", "Western European",
"French", "Western European",
"German", "Western European",
"Irish", "Western European",
"Pizza/Italian", "Western European",
"Italian", "Western European",
"Pizza", "Fast Food",
"Hamburgers", "Fast Food",
"Hotdogs", "Fast Food",
"Hotdogs/Pretzels", "Fast Food",
"Fruits/Vegetables", "Healthy Food",
"Vegetarian", "Healthy Food",
"Juice, Smoothies, Fruit Salads", "Healthy Food",
"Pancakes/Waffles", "Diner Food",
"Steak", "Diner Food",
"Soups", "Soups/Salads/Sandwiches",
"Soups & Sandwiches", "Soups/Salads/Sandwiches",
"Salads", "Soups/Salads/Sandwiches",
"Sandwiches/Salads/Mixed Buffet", "Soups/Salads/Sandwiches",
"Sandwiches", "Soups/Salads/Sandwiches",
"Delicatessen", "Soups/Salads/Sandwiches",
"Seafood", "Seafood",
"Ice Cream, Gelato, Yogurt, Ices", "Desserts",
"Donuts", "Desserts",
"Nuts/Confectionary", "Desserts",
"Latin (Cuban, Dominican, Puerto Rican, South & Central American)", "Latin",
"Peruvian", "Latin",
"Mexican", "Latin",
"Chilean", "Latin",
"Brazilian", "Latin",
"Mediterranean", "Mediterranean",
"Greek", "Mediterranean",
"Jewish/Kosher", "Mediterranean",
"Not Listed/Not Applicable", "NA/Other",
"Other", "NA/Other"
)
# Merge new labels into original dataset
df_new <-
df %>%
left_join(df_cuisine_labels, by = c("cuisine.description" = "orig_label"))
# Manipulate data to calculate share of each grade, by cuisine type
## Assumption: Focus only on observations with a current A/B/C grade
df.1 <-
df_new %>%
filter(grade %in% c("A", "B", "C")) %>%
group_by(new_label, grade) %>%
summarise(count = n()) %>%
mutate(perc = count / sum(count))
# Plot data in stacked percent bar chart
df.1 %>%
ggplot(mapping = aes(x = new_label, y = perc * 100, fill = grade)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(
aes(label = paste0(round(perc * 100, digits = 0),"%")),
position = position_stack(vjust = 0.5),
size = 2
) +
labs(
title = "Current Grades by Cuisine Type",
y = "Percent",
x = "Cuisine Type",
fill = "Grade"
)
From this chart, we see of restaurants with a current grade, dessert type businesses have the highest share of ‘A’ grades, followed by beverage businesses (e.g. coffee and tea) and diner food (e.g. hamburgers, hot dogs).
On the flip side, African cuisine type businesses and Creole/Cajun have the lowest shares of A grades, and Creole/Cajun has the highet share of C grades within their cusine type.
Now we will look at which health codes are most commonly violated.
df_new %>%
count(violation.code) %>%
arrange(-n)
Doing an initial count, we see that violation codes 10F, 08A, 04L, 06D, and 06C are the top violations. I looked up the description behind the top 10 code violations.
# Browse what the violation codes mean
df_new %>%
filter(violation.code == "10F")
Further exploration of the dataset shows that: * 10F is non-food contact surface is improperly constructed. * 08A is the facility is not vermin-proof. * 04L is evidence of mice or live mice present. * 06D is food contact surface not properly washed, rinsed, and sanitized. * 06C is food not protected from potential source of contamination. * 02G is cold food item held above 41F. * 10B is plumbing improperly installed or maintained. * 04N is filth flies or food/refuse/sewage-associated flies present. * 02B is hot food item not held at or above 140F. * 04H is raw, cooked, or prepared food is adulterated or contaminated.
Now we will plot the distribution of code violations. For visualization purposes, we will only plot the top 10 violations.
# Create dataset
## Includes counting each violation code, grabbing the top 10,
## and relabeling the numerical code with a description
df.2 <-
df_new %>%
count(violation.code) %>%
arrange(-n) %>%
mutate(
violation.code = recode(
violation.code,
`10F` = "10F: Improper non-food\ncontact surface",
`08A` = "08A: Not vermin-proof",
`04L` = "04L: Mice present",
`06D` = "06D: Food surface\nimproperly cleaned",
`06C` = "06C: Potential food\ncontamination",
`02G` = "02G: Improper cold\nfood temperature",
`10B` = "10B: Improper plumbing",
`04N` = "04N: Flies present",
`02B` = "02B: Improper hot\nfood temperature",
`04H` = "04H: Contaminated food"
)
) %>%
top_n(n = 10)
## Selecting by n
# Plot data
df.2 %>%
ggplot(
mapping = aes(
x = fct_reorder(violation.code, n, .desc = TRUE),
y = n
)
) +
geom_col(fill = "lightblue4") +
labs(
title = "Top 10 Health Code Violations",
x = "Health Code",
y = "Count"
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
The most frequently violated health code is improper non-food contact surface, followed by failure to be vermin-proof or the presence of mice.
Now we are interested in exploring if there is a statistically significant relationship between type of cuisine and restaurant grade. To test this, we will perform an ordinal logistic regression.
# Focus only on observations with a current grade
df_filtered <- df_new %>% filter(grade %in% c("A", "B", "C"))
Our outcome variable, grade, has three levels: “A”, “B”, and “C”. The test we will use to assess significance between cuisine type and grade is an ordinal logistic regression. In this regression, we will specify HESS = TRUE so that the model returns the observed information matrix from optimization, which we will use to obtain standard errors.
In order to run an ordinal logistic regression, we will drop some variables that are likely extraneous (e.g. camis, building) or have too many unique values (e.g. zipcode). We will call this pared down dataset df.3.
# Pare down data
df.3 <-
df_filtered %>%
drop_na() %>%
dplyr::select(
-c(
camis, dba, building, street, phone, cuisine.description,
inspection.date, action, violation.description, grade.date,
record.date, inspection.type, zipcode, violation.code
)
) %>%
mutate(
grade = as.factor(ifelse(grade == "A", 3, ifelse(grade == "B", 2, 1)))
)
# Fit ordinal logistic model
fit.logit <- polr(grade ~ ., data = df.3, Hess = TRUE)
# Print summary of results
summary(fit.logit)
## Call:
## polr(formula = grade ~ ., data = df.3, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## boroBROOKLYN 0.24401 4.666e-02 5.230e+00
## boroMANHATTAN 0.29755 4.469e-02 6.658e+00
## boroMissing 7.20064 8.145e-07 8.841e+06
## boroQUEENS 0.28501 4.762e-02 5.985e+00
## boroSTATEN ISLAND 0.37883 8.083e-02 4.687e+00
## critical.flagNot Critical 0.16224 2.609e-02 6.218e+00
## score -0.84796 5.435e-03 -1.560e+02
## new_labelAmerican 0.69255 1.626e-01 4.259e+00
## new_labelAsian 0.51805 1.630e-01 3.179e+00
## new_labelAustralian 1.28690 5.383e-01 2.391e+00
## new_labelBakery 0.54580 1.729e-01 3.158e+00
## new_labelBeverages 0.85035 1.752e-01 4.853e+00
## new_labelCreole/Cajun -0.08624 3.137e-01 -2.749e-01
## new_labelDesserts 0.94327 1.901e-01 4.962e+00
## new_labelDiner Food 1.60090 3.406e-01 4.701e+00
## new_labelEastern European 0.29721 2.150e-01 1.382e+00
## new_labelFast Food 0.75953 1.697e-01 4.476e+00
## new_labelHealthy Food 0.52463 1.934e-01 2.712e+00
## new_labelIslands 0.46452 1.721e-01 2.698e+00
## new_labelLatin 0.45131 1.652e-01 2.732e+00
## new_labelMediterranean 0.62004 1.757e-01 3.530e+00
## new_labelMiddle Eastern 0.47805 1.943e-01 2.461e+00
## new_labelNA/Other 2.28848 4.381e-01 5.223e+00
## new_labelSeafood 0.81354 2.318e-01 3.509e+00
## new_labelSoups/Salads/Sandwiches 0.71757 1.742e-01 4.120e+00
## new_labelSouth Asian 0.01444 1.774e-01 8.137e-02
## new_labelWestern European 0.66172 1.647e-01 4.019e+00
##
## Intercepts:
## Value Std. Error t value
## 1|2 -22.7276 0.2198 -103.4034
## 2|3 -12.5536 0.1792 -70.0674
##
## Residual Deviance: 47950.09
## AIC: 48008.09
By default, the ordinal logistic regression summary has no significance test (p-value). We do see the t-statistic, which is the coefficient divided by the standard error, so we will use this t-statistic to construct a p-value by comparing the t-stat against the standard normal distribution. This is a reasonable approximation as we have a large number of observations.
# Store coefficients in a table
coef_table <- coef(summary(fit.logit))
# Calculate p-values
p <- round(pnorm(abs(coef_table[ , "t value"]), lower.tail = FALSE) * 2, 3)
# Combine into one table
table <- cbind(coef_table, "p value" = p)
table
## Value Std. Error t value
## boroBROOKLYN 0.24401235 4.665781e-02 5.229829e+00
## boroMANHATTAN 0.29754840 4.469364e-02 6.657510e+00
## boroMissing 7.20064108 8.144938e-07 8.840634e+06
## boroQUEENS 0.28500989 4.761888e-02 5.985228e+00
## boroSTATEN ISLAND 0.37882602 8.083274e-02 4.686542e+00
## critical.flagNot Critical 0.16223601 2.609319e-02 6.217561e+00
## score -0.84796449 5.435067e-03 -1.560173e+02
## new_labelAmerican 0.69254956 1.626266e-01 4.258525e+00
## new_labelAsian 0.51805004 1.629666e-01 3.178873e+00
## new_labelAustralian 1.28689764 5.382747e-01 2.390782e+00
## new_labelBakery 0.54580165 1.728537e-01 3.157592e+00
## new_labelBeverages 0.85034506 1.752191e-01 4.853040e+00
## new_labelCreole/Cajun -0.08623749 3.137071e-01 -2.748981e-01
## new_labelDesserts 0.94326735 1.900996e-01 4.961964e+00
## new_labelDiner Food 1.60089829 3.405587e-01 4.700800e+00
## new_labelEastern European 0.29720593 2.150016e-01 1.382343e+00
## new_labelFast Food 0.75953295 1.696804e-01 4.476255e+00
## new_labelHealthy Food 0.52462994 1.934162e-01 2.712440e+00
## new_labelIslands 0.46451736 1.721407e-01 2.698474e+00
## new_labelLatin 0.45130938 1.651836e-01 2.732168e+00
## new_labelMediterranean 0.62003799 1.756517e-01 3.529929e+00
## new_labelMiddle Eastern 0.47804701 1.942577e-01 2.460891e+00
## new_labelNA/Other 2.28847793 4.381132e-01 5.223485e+00
## new_labelSeafood 0.81354186 2.318155e-01 3.509436e+00
## new_labelSoups/Salads/Sandwiches 0.71757435 1.741505e-01 4.120426e+00
## new_labelSouth Asian 0.01443798 1.774447e-01 8.136610e-02
## new_labelWestern European 0.66172363 1.646603e-01 4.018720e+00
## 1|2 -22.72758417 2.197953e-01 -1.034034e+02
## 2|3 -12.55358765 1.791643e-01 -7.006745e+01
## p value
## boroBROOKLYN 0.000
## boroMANHATTAN 0.000
## boroMissing 0.000
## boroQUEENS 0.000
## boroSTATEN ISLAND 0.000
## critical.flagNot Critical 0.000
## score 0.000
## new_labelAmerican 0.000
## new_labelAsian 0.001
## new_labelAustralian 0.017
## new_labelBakery 0.002
## new_labelBeverages 0.000
## new_labelCreole/Cajun 0.783
## new_labelDesserts 0.000
## new_labelDiner Food 0.000
## new_labelEastern European 0.167
## new_labelFast Food 0.000
## new_labelHealthy Food 0.007
## new_labelIslands 0.007
## new_labelLatin 0.006
## new_labelMediterranean 0.000
## new_labelMiddle Eastern 0.014
## new_labelNA/Other 0.000
## new_labelSeafood 0.000
## new_labelSoups/Salads/Sandwiches 0.000
## new_labelSouth Asian 0.935
## new_labelWestern European 0.000
## 1|2 0.000
## 2|3 0.000
We conducted many hypothesis tests at once since we had so many cuisine types. To adjust for this and to decrease our chance of incorrectly rejecting the null hypothesis, I will do a Bonferroni correction. This correction poses a stricter p-value requirement that depends on the number of tests we did. Here, we did 27 tests, so in order for a coefficient to be considered statistically significant at the 5% level, it must have a p-value less than 0.05 / 27 = 0.00185.
# Add in Bonferroni-adjusted p-value
complete_table <- cbind(table, "adj p value" = 0.00185)
complete_table
## Value Std. Error t value
## boroBROOKLYN 0.24401235 4.665781e-02 5.229829e+00
## boroMANHATTAN 0.29754840 4.469364e-02 6.657510e+00
## boroMissing 7.20064108 8.144938e-07 8.840634e+06
## boroQUEENS 0.28500989 4.761888e-02 5.985228e+00
## boroSTATEN ISLAND 0.37882602 8.083274e-02 4.686542e+00
## critical.flagNot Critical 0.16223601 2.609319e-02 6.217561e+00
## score -0.84796449 5.435067e-03 -1.560173e+02
## new_labelAmerican 0.69254956 1.626266e-01 4.258525e+00
## new_labelAsian 0.51805004 1.629666e-01 3.178873e+00
## new_labelAustralian 1.28689764 5.382747e-01 2.390782e+00
## new_labelBakery 0.54580165 1.728537e-01 3.157592e+00
## new_labelBeverages 0.85034506 1.752191e-01 4.853040e+00
## new_labelCreole/Cajun -0.08623749 3.137071e-01 -2.748981e-01
## new_labelDesserts 0.94326735 1.900996e-01 4.961964e+00
## new_labelDiner Food 1.60089829 3.405587e-01 4.700800e+00
## new_labelEastern European 0.29720593 2.150016e-01 1.382343e+00
## new_labelFast Food 0.75953295 1.696804e-01 4.476255e+00
## new_labelHealthy Food 0.52462994 1.934162e-01 2.712440e+00
## new_labelIslands 0.46451736 1.721407e-01 2.698474e+00
## new_labelLatin 0.45130938 1.651836e-01 2.732168e+00
## new_labelMediterranean 0.62003799 1.756517e-01 3.529929e+00
## new_labelMiddle Eastern 0.47804701 1.942577e-01 2.460891e+00
## new_labelNA/Other 2.28847793 4.381132e-01 5.223485e+00
## new_labelSeafood 0.81354186 2.318155e-01 3.509436e+00
## new_labelSoups/Salads/Sandwiches 0.71757435 1.741505e-01 4.120426e+00
## new_labelSouth Asian 0.01443798 1.774447e-01 8.136610e-02
## new_labelWestern European 0.66172363 1.646603e-01 4.018720e+00
## 1|2 -22.72758417 2.197953e-01 -1.034034e+02
## 2|3 -12.55358765 1.791643e-01 -7.006745e+01
## p value adj p value
## boroBROOKLYN 0.000 0.00185
## boroMANHATTAN 0.000 0.00185
## boroMissing 0.000 0.00185
## boroQUEENS 0.000 0.00185
## boroSTATEN ISLAND 0.000 0.00185
## critical.flagNot Critical 0.000 0.00185
## score 0.000 0.00185
## new_labelAmerican 0.000 0.00185
## new_labelAsian 0.001 0.00185
## new_labelAustralian 0.017 0.00185
## new_labelBakery 0.002 0.00185
## new_labelBeverages 0.000 0.00185
## new_labelCreole/Cajun 0.783 0.00185
## new_labelDesserts 0.000 0.00185
## new_labelDiner Food 0.000 0.00185
## new_labelEastern European 0.167 0.00185
## new_labelFast Food 0.000 0.00185
## new_labelHealthy Food 0.007 0.00185
## new_labelIslands 0.007 0.00185
## new_labelLatin 0.006 0.00185
## new_labelMediterranean 0.000 0.00185
## new_labelMiddle Eastern 0.014 0.00185
## new_labelNA/Other 0.000 0.00185
## new_labelSeafood 0.000 0.00185
## new_labelSoups/Salads/Sandwiches 0.000 0.00185
## new_labelSouth Asian 0.935 0.00185
## new_labelWestern European 0.000 0.00185
## 1|2 0.000 0.00185
## 2|3 0.000 0.00185
We see in the adjusted table that all of the boroughs are significant, as well as the following cuisine types:
In other words, there is a statistically significant relationship between type of cuisine and restaurant grade, depending on the cuisine type.
The resulting coefficients are proportional odds ratios. As an example of how to interpret them, we see that an “American” cuisine label has an estimated coefficient of -0.69239009. We would thus say that if a restaurant serves American cuisine, the odds of it having a grade “A” versus a “B” or “C” are 0.69 lower, given that all other variables are constant. So American restaurants are associated with lower grades. All of the statistically significant cuisine types in the list above had negative coefficients, so these cuisines are associated with lower odds of having a grade A.
The model also created cutpoints for “A” and “B” scores, as well as “B” and “C” scores. These estimates suggest where the variable was cut to make the three grade classes observed in the data. However, these should be used more as thresholds and should not be used for interpretation or inference purposes.
Based on the data, to help the DOHMH prioritize inspections, I suggest that restaurants with the categories above are deprioritized in their visits. These restaurant types are more likely to have higher grades and, by definition, fewer violations. Combining the exploratory data analysis done in Question 1, I would recommend that the DOHMH prioritize African, Creole/Cajun, and South Asian restaurants in their inspections.
Aside from prioritizing certain types of restaurants, the DOHMH could also proactively address the most common types of violations. It’s likely that different types of restaurants commit different types of violations. For example, I imagine an Indian restaurant would commit different health code violations than a candy store (see below):
# Violations by Cuisine Type
df_filtered %>%
filter(new_label == "South Asian") %>%
count(violation.code) %>%
arrange(-n) %>%
top_n(n = 5)
## Selecting by n
df_filtered %>%
filter(new_label == "Desserts") %>%
count(violation.code) %>%
arrange(-n) %>%
top_n(n = 5)
## Selecting by n
One action the department can take is to create pamphlets for each cuisine type that lists the top 5 most commonly violated health codes for that restaurant category. The department can then distribute these pamphlets to each restaurant in New York City.